To download Natual Language tool kit in terminal type sudo pip install -U nltk

You'll need of gensim package too. To get this package type in the terminal pip install -U gensim.

sudo pip install textblob

sudo pip install fuzzy



In [1]:

    
import nltk 
#nltk.download()



In [2]:

    
# Sample code to remove noisy words from a text

noise_list = ["is", "a", "this", "..."] 
def _remove_noise(input_text):
    words = input_text.split() 
    noise_free_words = [word for word in words if word not in noise_list] 
    noise_free_text = " ".join(noise_free_words) 
    return noise_free_text



In [3]:

    
_remove_noise("this is a sample text")









    Out[3]:





'sample text'

Text processing steps:

Noise Removal
Lexicon Normalization
Object Standardization

Text processing pipeline

1) Raw Text
2) (Noisy Entities Removal) Stopwords, URLs, punctuations, mentions, etc
3) (Word Normalization) Tokenization, Memmatization, Stemming
4) (Word Standarization) Regular Expressions, Lookup tables
5) Cleaned text

2) (Noisy Entities Removal) Stopwords, URLs, punctuations, mentions, etc



In [4]:

    
# Sample code to remove noisy words from a text

noise_list = ["is", "a", "this", "..."] 
def _remove_noise(input_text):
    words = input_text.split() 
    noise_free_words = [word for word in words if word not in noise_list] 
    noise_free_text = " ".join(noise_free_words) 
    return noise_free_text



In [5]:

    
_remove_noise("this is a sample text")









    Out[5]:





'sample text'



In [6]:

    
# Sample code to remove a regex pattern 
import re



In [7]:

    
def _remove_regex(input_text, regex_pattern):
    urls = re.finditer(regex_pattern, input_text) 
    for i in urls: 
        input_text = re.sub(i.group().strip(), '', input_text)
    return input_text



In [8]:

    
regex_pattern = "#[\w]*"



In [9]:

    
_remove_regex("remove this #hashtag from analytics vidhya", regex_pattern)









    Out[9]:





'remove this  from analytics vidhya'

2.2) Lexicon Normalization

For example – “play”, “player”, “played”, “plays” and “playing” are the different variations of the word – “play”, Though they mean different but contextually all are similar. The step converts all the disparities of a word into their normalized form (also known as lemma).

Stemming: Stemming is a rudimentary rule-based process of stripping the suffixes (“ing”, “ly”, “es”, “s” etc) from a word.
Lemmatization: Lemmatization, on the other hand, is an organized & step by step procedure of obtaining the root form of the word, it makes use of vocabulary (dictionary importance of words) and morphological analysis (word structure and grammar relations).



In [11]:

    
from nltk.stem.wordnet import WordNetLemmatizer 
from nltk.stem.porter import PorterStemmer



In [12]:

    
lem = WordNetLemmatizer()
stem = PorterStemmer()



In [13]:

    
word = "multiplying"



In [14]:

    
lem.lemmatize(word, "v")









    Out[14]:





'multiply'



In [15]:

    
stem.stem(word)









    Out[15]:





'multipli'

2.3) Object Standardization

Text data often contains words or phrases which are not present in any standard lexical dictionaries. These pieces are not recognized by search engines and models.

Some of the examples are – acronyms, hashtags with attached words, and colloquial slangs. With the help of regular expressions and manually prepared data dictionaries, this type of noise can be fixed, the code below uses a dictionary lookup method to replace social media slangs from a text.



In [17]:

    
lookup_dict = {'rt':'Retweet', 'dm':'direct message', "awsm" : "awesome", "luv" :"love"}



In [24]:

    
def _lookup_words(input_text):
    words = input_text.split() 
    new_words = [] 
    for word in words:
        if word.lower() in lookup_dict:
            word = lookup_dict[word.lower()]
        new_words.append(word) new_text = " ".join(new_words) 
        return new_text









    



  File "<ipython-input-24-4b0dfb951aaf>", line 7
    new_words.append(word) new_text = " ".join(new_words)
                                  ^
SyntaxError: invalid syntax



In [25]:

    
_lookup_words("RT this is a retweeted tweet by Shivam Bansal")









    



---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-25-26d2751afec0> in <module>()
----> 1 _lookup_words("RT this is a retweeted tweet by Shivam Bansal")

<ipython-input-22-ba79bd13da6c> in _lookup_words(input_text)
      6             word = lookup_dict[word.lower()]
      7         #new_words.append(word) new_text = " ".join(new_words)
----> 8         return new_text

NameError: name 'new_text' is not defined

3.Text to Features (Feature Engineering on text data)



In [26]:

    
from nltk import word_tokenize, pos_tag



In [27]:

    
text = "I am learning Natural Language Processing on Analytics Vidhya"



In [28]:

    
tokens = word_tokenize(text)



In [29]:

    
print pos_tag(tokens)









    



  File "<ipython-input-29-10666751bd39>", line 1
    print pos_tag(tokens)
                ^
SyntaxError: invalid syntax



In [30]:

    
pos_tag(tokens)









    Out[30]:





[('I', 'PRP'),
 ('am', 'VBP'),
 ('learning', 'VBG'),
 ('Natural', 'NNP'),
 ('Language', 'NNP'),
 ('Processing', 'NNP'),
 ('on', 'IN'),
 ('Analytics', 'NNP'),
 ('Vidhya', 'NNP')]



In [31]:

    
print (pos_tag(tokens) )









    



[('I', 'PRP'), ('am', 'VBP'), ('learning', 'VBG'), ('Natural', 'NNP'), ('Language', 'NNP'), ('Processing', 'NNP'), ('on', 'IN'), ('Analytics', 'NNP'), ('Vidhya', 'NNP')]



In [32]:

    
doc1 = "Sugar is bad to consume. My sister likes to have sugar, but not my father." 
doc2 = "My father spends a lot of time driving my sister around to dance practice."
doc3 = "Doctors suggest that driving may cause increased stress and blood pressure."



In [33]:

    
doc_complete = [doc1, doc2, doc3]



In [34]:

    
doc_clean = [doc.split() for doc in doc_complete]



In [43]:

    
import gensim 

from gensim import corpora



In [44]:

    
# Creating the term dictionary of our corpus, where every unique term is assigned an index.  
dictionary = corpora.Dictionary(doc_clean)



In [45]:

    
# Converting list of documents (corpus) into Document Term Matrix using dictionary prepared above. 
doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean]



In [46]:

    
# Creating the object for LDA model using gensim library
Lda = gensim.models.ldamodel.LdaModel



In [47]:

    
# Running and Training LDA model on the document term matrix
ldamodel = Lda(doc_term_matrix, num_topics=3, id2word = dictionary, passes=50)



In [48]:

    
# Results 
print(ldamodel.print_topics())









    



[(0, '0.029*"sister" + 0.029*"my" + 0.029*"My" + 0.029*"to" + 0.029*"stress" + 0.029*"pressure." + 0.029*"increased" + 0.029*"that" + 0.029*"cause" + 0.029*"and"'), (1, '0.089*"to" + 0.051*"My" + 0.051*"my" + 0.051*"sister" + 0.051*"consume." + 0.051*"sugar," + 0.051*"Sugar" + 0.051*"father." + 0.051*"bad" + 0.051*"but"'), (2, '0.064*"driving" + 0.037*"around" + 0.037*"dance" + 0.037*"practice." + 0.037*"time" + 0.037*"a" + 0.037*"lot" + 0.037*"of" + 0.037*"father" + 0.037*"spends"')]



In [49]:

    
def generate_ngrams(text, n):
    words = text.split()
    output = []  
    for i in range(len(words)-n+1):
        output.append(words[i:i+n])
    return output



In [50]:

    
generate_ngrams('this is a sample text', 2)









    Out[50]:





[['this', 'is'], ['is', 'a'], ['a', 'sample'], ['sample', 'text']]



In [51]:

    
from sklearn.feature_extraction.text import TfidfVectorizer



In [52]:

    
obj = TfidfVectorizer()



In [53]:

    
corpus = ['This is sample document.', 'another random document.', 'third sample document text']



In [54]:

    
X = obj.fit_transform(corpus)



In [55]:

    
print (X)









    



  (0, 7)	0.58448290102
  (0, 2)	0.58448290102
  (0, 4)	0.444514311537
  (0, 1)	0.345205016865
  (1, 1)	0.385371627466
  (1, 0)	0.652490884513
  (1, 3)	0.652490884513
  (2, 4)	0.444514311537
  (2, 1)	0.345205016865
  (2, 6)	0.58448290102
  (2, 5)	0.58448290102



In [56]:

    
from gensim.models import Word2Vec



In [57]:

    
sentences = [['data', 'science'], ['vidhya', 'science', 'data', 'analytics'],['machine', 'learning'], ['deep', 'learning']]



In [58]:

    
# train the model on your corpus  
model = Word2Vec(sentences, min_count = 1)









    



WARNING:gensim.models.word2vec:under 10 jobs per worker: consider setting a smaller `batch_words' for smoother alpha decay



In [60]:

    
print (model.similarity('data', 'science'))









    



0.098663955684



In [61]:

    
print (model['learning'])









    



[  1.08607556e-03   4.62277047e-03   2.58435309e-03  -4.26230673e-03
   4.32864809e-03  -4.04330960e-04  -1.75475678e-03  -3.26948427e-03
  -4.35007038e-03  -9.38271580e-04  -2.72817072e-03   3.39866313e-03
  -4.28924803e-03   2.53001228e-03  -1.47502718e-03  -4.54866001e-03
  -1.19755440e-03   1.36745919e-03  -4.99364780e-03   4.39920370e-03
  -8.78889696e-04   2.55907397e-03  -4.47233114e-03  -2.98093841e-03
  -3.04079871e-03   3.77006200e-03  -5.06169279e-04  -1.58164476e-04
  -4.97120013e-03   4.11883416e-03  -1.16382574e-03   3.81740881e-03
  -1.49161392e-03  -4.03360883e-03   3.22279660e-03  -2.94679590e-03
   3.02863074e-03  -3.42801865e-03  -3.52651492e-04  -3.85172991e-03
  -2.11770809e-03  -4.80807154e-03   1.37284151e-04   2.76812771e-03
  -3.94002767e-03  -3.65456362e-04  -1.79178803e-03  -3.28000169e-03
  -1.05990539e-03   2.68064812e-03   8.77506754e-05  -2.99095735e-03
   1.89492374e-03   4.13068919e-05  -5.89237607e-04  -6.49927882e-04
   2.57901847e-03  -7.03117403e-04   4.20667045e-03  -2.76946439e-03
   2.95516499e-03  -2.25505629e-03  -1.83734728e-03   2.27217446e-03
   4.41791257e-03   2.84861424e-04   3.51103576e-04   3.54659790e-03
  -2.52496474e-03   4.43121325e-03  -1.20758770e-04   8.45276692e-04
   4.18963004e-03  -2.65925453e-04   4.44951747e-03  -3.20018292e-03
   3.99162294e-03   3.94796021e-04  -1.47228176e-03   1.47388573e-03
   1.93689705e-03   2.78494786e-04  -3.44451889e-03   4.34742076e-03
   3.53631843e-03   1.57816766e-03  -3.53800104e-04   1.87509082e-04
  -2.43617431e-03   1.70787866e-03  -4.31234203e-03  -4.08355193e-03
  -4.45934990e-03   2.58488394e-03  -2.31626094e-03  -1.79338094e-03
   3.77377262e-03  -4.37695207e-03   9.26580222e-04  -3.43537773e-03]



In [64]:

    
from textblob.classifiers import NaiveBayesClassifier as NBC



In [65]:

    
from textblob import TextBlob



In [66]:

    
training_corpus = [
                   ('I am exhausted of this work.', 'Class_B'),
                   ("I can't cooperate with this", 'Class_B'),
                   ('He is my badest enemy!', 'Class_B'),
                   ('My management is poor.', 'Class_B'),
                   ('I love this burger.', 'Class_A'),
                   ('This is an brilliant place!', 'Class_A'),
                   ('I feel very good about these dates.', 'Class_A'),
                   ('This is my best work.', 'Class_A'),
                   ("What an awesome view", 'Class_A'),
                   ('I do not like this dish', 'Class_B')]



In [67]:

    
test_corpus = [
                ("I am not feeling well today.", 'Class_B'), 
                ("I feel brilliant!", 'Class_A'), 
                ('Gary is a friend of mine.', 'Class_A'), 
                ("I can't believe I'm doing this.", 'Class_B'), 
                ('The date was good.', 'Class_A'), ('I do not enjoy my job', 'Class_B')]



In [68]:

    
model = NBC(training_corpus)



In [69]:

    
print((model.classify("Their codes are amazing.")))









    



Class_A



In [70]:

    
print((model.classify("I don't like their computer.")))









    



Class_B



In [71]:

    
print((model.accuracy(test_corpus)))









    



0.8333333333333334



In [73]:

    
from sklearn.feature_extraction.text









    



  File "<ipython-input-73-30f02bccd6dd>", line 1
    from sklearn.feature_extraction.text
                                        ^
SyntaxError: invalid syntax



In [80]:

    
#import TfidfVectorizer from sklearn.metrics
from sklearn.feature_extraction.text import TfidfVectorizer



In [93]:

    
#import classification_report
from sklearn import metrics
from sklearn.metrics import classification_report



In [76]:

    
from sklearn import svm



In [81]:

    
#from sklearn import sklearn.feature_extraction.text



In [83]:

    
# preparing data for SVM model (using the same training_corpus, test_corpus from naive bayes example)
train_data = []
train_labels = []
for row in training_corpus:
    train_data.append(row[0])
    train_labels.append(row[1])

test_data = [] 
test_labels = [] 
for row in test_corpus:
    test_data.append(row[0]) 
    test_labels.append(row[1])



In [84]:

    
# Create feature vectors 
vectorizer = TfidfVectorizer(min_df=4, max_df=0.9)



In [85]:

    
# Train the feature vectors
train_vectors = vectorizer.fit_transform(train_data)



In [86]:

    
# Apply model on test data 
test_vectors = vectorizer.transform(test_data)



In [87]:

    
# Perform classification with SVM, kernel=linear 
model = svm.SVC(kernel='linear')



In [88]:

    
model.fit(train_vectors, train_labels)









    Out[88]:





SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)



In [89]:

    
prediction = model.predict(test_vectors)



In [94]:

    
print ((classification_report(test_labels, prediction)))









    



             precision    recall  f1-score   support

    Class_A       0.50      0.67      0.57         3
    Class_B       0.50      0.33      0.40         3

avg / total       0.50      0.50      0.49         6



In [95]:

    
def levenshtein(s1,s2): 
    if len(s1) > len(s2):
        s1,s2 = s2,s1 
    distances = range(len(s1) + 1) 
    for index2,char2 in enumerate(s2):
        newDistances = [index2+1]
        for index1,char1 in enumerate(s1):
            if char1 == char2:
                newDistances.append(distances[index1]) 
            else:
                 newDistances.append(1 + min((distances[index1], distances[index1+1], newDistances[-1]))) 
        distances = newDistances 
    return distances[-1]



In [96]:

    
print(levenshtein("analyze","analyse"))



In [100]:

    
import fuzzy









    



---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-100-c25d859f085a> in <module>()
----> 1 import fuzzy

ImportError: No module named 'fuzzy'



In [98]:

    
soundex = fuzzy.Soundex(4)









    



---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-98-89da4f0cbd79> in <module>()
----> 1 soundex = fuzzy.Soundex(4)

NameError: name 'fuzzy' is not defined



In [99]:

    
print (soundex('ankit'))









    



---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-99-9246f0b975de> in <module>()
----> 1 print (soundex('ankit'))

NameError: name 'soundex' is not defined



In [ ]:

    
print soundex('aunkit')



In [ ]:

    
import math



In [ ]:

    
from collections import Counter



In [ ]:

    
def get_cosine(vec1, vec2):
    common = set(vec1.keys()) & set(vec2.keys())
    numerator = sum([vec1[x] * vec2[x] for x in common])

    sum1 = sum([vec1[x]**2 for x in vec1.keys()]) 
    sum2 = sum([vec2[x]**2 for x in vec2.keys()]) 
    denominator = math.sqrt(sum1) * math.sqrt(sum2)
   
    if not denominator:
        return 0.0 
    else:
        return float(numerator) / denominator



In [ ]:

    
def text_to_vector(text): 
    words = text.split() 
    return Counter(words)



In [101]:

    
text1 = 'This is an article on analytics vidhya' 
text2 = 'article on analytics vidhya is about natural language processing'



In [ ]:

    
vector1 = text_to_vector(text1)



In [ ]:

    
vector2 = text_to_vector(text2)



In [ ]:

    
cosine = get_cosine(vector1, vector2)